Protlearn is a module that applies Machine Learning techniques to extract meaningful information from Multiple Sequence Alignments (MSA) of homologous protein families.
Copyright (C) 2023, Lucas Carrijo de Oliveira (lucas@ebi.ac.uk)
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.
from protlearn import *
[nltk_data] Downloading package punkt to /Users/lucas/nltk_data... [nltk_data] Package punkt is already up-to-date! [nltk_data] Downloading package stopwords to /Users/lucas/nltk_data... [nltk_data] Package stopwords is already up-to-date! [nltk_data] Downloading package wordnet to /Users/lucas/nltk_data... [nltk_data] Package wordnet is already up-to-date! [nltk_data] Downloading package omw-1.4 to /Users/lucas/nltk_data... [nltk_data] Package omw-1.4 is already up-to-date!
class MSA(pd.DataFrame)
A class for processing and analyzing Multiple Sequence Alignments (MSA).
def __init__(self, msa_file, msa_format="fasta", *args, **kwargs)
Initialize the MSA object.
Arguments:
msa_file str - The path to the MSA file.msa_format str, optional - The format of the MSA file (default is "fasta").*args - Additional positional arguments to pass to SeqIO.parse.**kwargs - Additional keyword arguments to pass to SeqIO.parse.# Create a dummy class `Args` that pretends to be `argparse` so `args` works in this Jupyter notebook.
class Args:
def __init__(self):
self.msa_file = "pf00848-alignment.fasta"
self.metadata_file = "pf00848-metadata.tsv"
self.plot = True # Set to True or False as needed
self.save = True # Set to True or False as needed
self.show = True # Set to True or False as needed
# Create an instance of the Args class
args = Args()
# Create data frame from raw data and clean it
msa = MSA(args.msa_file)
n_rows, n_columns = msa.shape
print(f"Data from {args.msa_file} contains {n_rows} sequences and {n_columns} columns")
Data from pf00848-alignment.fasta contains 9381 sequences and 1225 columns
def map_positions()
Map residue positions in the Multiple Sequence Alignment (MSA) based on sequence headers.
This method calculates the position of each residue in the MSA sequences and stores the mapping in the 'positions_map' attribute of the MSA object. The mapping is based on the sequence headers, which typically include information about the sequence's starting position.
The 'positions_map' is a dictionary with sequence headers as keys and a sub-dictionary as values. The sub-dictionary contains MSA columns (keys) and their corresponding positions (values).
For example: { 'Seq1/1-100': {21: 1, 25: 2, ...}, 'Seq2/171-432': {101: 171, 103: 172, ...}, ... }
This mapping is useful for downstream analysis that requires knowing the position of residues in the MSA.
Notes:
Example:
msa = MSA() msa.parse_msa_file('example.fasta') msa.map_positions()
Access the mapping: positions = msa.positions_map
msa.map_positions()
def cleanse(indel='-', remove_lowercase=True, threshold=.9, plot=False, save=False, show=False)
Cleanse the MSA data by removing columns and rows with missing values.
Arguments:
indel str, optional - The character representing gaps/indels (default is '-').remove_lowercase bool, optional - Whether to remove lowercase characters (default is True).threshold float, optional - The threshold for missing values (default is 0.9).plot bool, optional - Whether to plot a heatmap of missing values (default is False).msa.cleanse(plot=args.plot, show=args.show)
print(f"Before removing null values: {msa.dirty.shape}")
print(f"After removing null values: {msa.clean.shape}")
print(f"After removing duplicates: {msa.unique_sequences.shape}")
Before removing null values: (9381, 1225) After removing null values: (7568, 41) After removing duplicates: (6469, 41)
def reduce(plot=False, *args, **kwargs)
Perform Multidimensional Correspondence Analysis (MCA) on the MSA data to reduce dimensionality.
Arguments:
plot bool, optional - Whether to plot the results (default is False).
*args, **kwargs: Additional arguments and keyword arguments for the MCA.Notes:
Example:
msa = MSA('example.fasta') msa.map_positions() msa.cleanse() msa.reduce(plot=True)
msa.reduce(n_components=2, plot=args.plot)
print(f"Dimensions reduced from {msa.unique_sequences.shape} to {msa.coordinates.shape}")
Dimensions reduced from (6469, 41) to (6469, 2)
def cluster_sequences(min_clusters=2,
max_clusters=10,
method='single-linkage',
plot=False, save=False, show=False)
Cluster the MSA data and obtain cluster labels.
Arguments:
min_clusters int, optional - Minimum number of clusters (default is 2).max_clusters int, optional - Maximum number of clusters (default is 10).method str, optional - Clustering method ('k-means' or 'single-linkage') (default is 'single-linkage').plot bool, optional - Whether to plot the clustering results (default is False).Notes:
Example:
msa = MSA('example.fasta') msa.map_positions() msa.cleanse() msa.cluster_sequences(method='single-linkage', min_clusters=3, plot=True)
msa.cluster_sequences(method='single-linkage', min_clusters=3)
def generate_wordclouds(path_to_metadata=None,
column='Protein names',
plot=False, save=False, show=False)
Generate word cloud visualizations from protein names in a DataFrame.
Arguments:
path_to_metadata str, default=None - Path to metadata file in tsv format.column str, default='Protein names' - The name of the column in the DataFrame containing protein names.plot bool, default=False - Whether to plot word clouds.
This method extracts substrate and enzyme names from the specified column using regular expressions, normalizes the names, and creates word cloud plots for each cluster of sequences.
Example:
msa = MSA('example.fasta') msa.map_positions() msa.cleanse() msa.cluster_sequences(method='single-linkage', min_clusters=3) msa.generate_wordclouds(path_to_metadata='metadata.tsv', plot=True)
msa.generate_wordclouds(path_to_metadata=args.metadata_file, plot=args.plot, show=args.show)
def select_features(n_estimators=None, random_state=None, plot=False, save=False, show=False)
Select important features (residues) from the MSA data.
Arguments:
n_estimators int, optional - Parameter n_estimators for RandomForest.random_state - (int, optional): Parameter random_state for RandomForest.plot bool, optional - Whether to plot feature selection results (default is False).msa.select_features(n_estimators=1000, random_state=42, plot=args.plot, show=args.show)
def select_residues(threshold=0.9, top_n=None, plot=False, save=False, show=False)
Select and store residues to be candidates for Specificity-determining Positions (SDPs) from the MSA data.
Arguments:
threshold float, optional - The threshold for selecting residues based on importance (default is 0.9).top_n int, optional - The top N residues to select based on importance (default is None).plot bool, optional - Whether to plot the selected residues (default is False).msa.select_residues(top_n=3, plot=args.plot, show=args.show)
def generate_logos(plot=False, save=False, show=False)
Generate logos from the MSA data for each cluster label.
Arguments:
plot bool, optional - Whether to plot the generated logos (default is False).msa.generate_logos(plot=args.plot, show=args.show)
Warning: Character '-' is not in color_dict. Using black. Warning: Character '-' is not in color_dict. Using black.
Attention! The following snippets are not implemented in protlearn module. From now on, these are manual steps specific to this case study.
The following code snippet filters sequences from the msa.profiles DataFrame based on whether their header contains a known structure from the known_structure list. Sequences with headers matching those in the known_structure list are retained in the filtered DataFrame.
known_structure = [
'TPDA2_COMSP',
'BPHA1_RHOJR',
'NDOB_PSEU8',
'NDOB_PSEPU',
'NAGG_RALSP',
'CNTA_ACIB2',
'BNZA_PSEP1',
'BPHA_PARXL',
'BPHA_COMTE',
]
have_structure = []
# Iterate through headers and filter sequences with known structures
for header in msa.profiles.index:
if header.split('/')[0] in known_structure:
have_structure.append(header)
# Filter and display the sequences with known structures
filtered_sequences = msa.profiles[msa.profiles.index.isin(have_structure)]
filtered_sequences
| 200 | 784 | 982 | |
|---|---|---|---|
| BPHA_PARXL/183-451 | Ser229 | Arg340 | Asp387 |
| BPHA1_RHOJR/174-441 | Ser220 | Arg330 | Asp377 |
In the following code, we use the nglview library to visualize a protein structure loaded from the Protein Data Bank (PDB). We load the PDB structure with the ID '1ULI,' which corresponds to 'BPHA1_RHOJR/174-441.'
```python
import nglview as nv
pdb_id = '1ULI' view = nv.show_pdbid(pdb_id) view.update_representation(opacity=0.5)
sdps, chains = (220, 330, 377), ('A', 'C', 'E') selection_str = ":{} and ( {} or {} or {} )"
for chain in chains:
view.add_representation('ball+stick', selection=selection_str.format(chain, *sdps), color='black')
view.add_representation('label', selection=selection_str.format(chain, *sdps), color='red')
view
from IPython.display import display, Image
# Load an image from a file
image_path = "./output/molecule-zoom-out.png" # Replace with the path to your image file
image = Image(filename=image_path)
# Display the image
display(image)